Improved Estimation of High-dimensional Ising Models 
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Abstract 



We consider the problem of jointly estimat- 
ing the parameters as well as the structure 
of binary valued Markov Random Fields, in 
contrast to earlier work that focus on one of 
the two problems. We formulate the problem 
as a maximization of £i -regularized surrogate 
likelihood that allows us to find a sparse solu- 
tion. Our optimization technique efficiently 
incorporates the cutting-plane algorithm in 
order to obtain a tighter outer bound on the 
marginal polytope, which results in improve- 
ment of both parameter estimates and ap- 
proximation to marginals. On synthetic data, 
we compare our algorithm on the two estima- 
tion tasks to the other existing methods. We 
analyze the method in the high-dimensional 
setting, where the number of dimensions p is 
allowed to grow with the number of observa- 
tions n. The rate of convergence of the es- 
timate is demonstrated to depend explicitly 
on the sparsity of the underlying graph. 



1 Introduction 

Undirected graphical models, also known as Markov 
random fields (MRFs), have been successfully applied 
in a variety of domains, including natural language 
processing, computer vision, image analysis, spatial 
data analysis and statistical physics. In most of these 
domains, the structure of the graphical models is con- 
structed by hand. However, in certain complex do- 
mains we have little expertise about interactions be- 
tween features in data and we need a method that 
automatically selects a model that represents data 
well. We propose an algorithm that is able to learn 
a sparse model that fits data well and has an easily 
interpretable structure. 



Eric P. Xing 

School of Computer Science 
Carnegie Mellon University 
Pittsburgh, PA 15213 
epxing@cs.cmu.edu 



Let X = {Xi, . . . , Xp)^ be a random vector with dis- 
tribution P that can be represented by an undirected 
graph G = (V, E) . Each vertex from the set V is 
associated with one component of the random vector 
X. The edge set E of the graph G encodes certain 
conditional independence assumptions among subsets 
of the p-dimensional random vector X; Xi is condi- 
tionally independent of Xj given the other variables if 

i E. Let V = {a;« = {x["\ . . . ,4'^)}^=! be an 
i.i.d. sample. Our task is to estimate the edge set E as 
well as the parameters of the distribution that gener- 
ated the sample. The main contribution of our paper 
is twofold: the development of an efficient algorithm 
that estimates the undirected graphical model from 
data, and the high-dimensional asymptotic analysis of 
its estimates. We find that the rate of convergence of 
the estimate explicitly depends on the sparsity of the 
underlying graphical model. 

Under the assumption that X is Gaussian, estimation 
of the graph structure is equivalent to estimation of 
zeros in the inverse covariance matrix S^^. Several 
methods have been proposed that estimate the graph 
structure from V. [4] proposed a method that tests 
for partial correlations that are not significantly differ- 
ent from zero, which can be applied when p is small. 
In high dimensional setting, when p is large, the esti- 
mation is more complicated, but under the assump- 
tion that the graph is sparse, several methods can 
be employed successfully for structure recovery (e.g 
[10, 1, 7, 12]). If the random variable X is discrete, the 
problem of structure estimation becomes even harder 
as the likelihood cannot be optimized efficiently due 
to the problem of evaluation of the log-partition func- 
tion. Many recently proposed methods make use of £i 
regularization to learn sparse undirected models. [15] 
used a pseudo-likelihood approach, based on the local 
conditional likelihood at each node, which results in 
a consistent estimate of the structure, but not neces- 
sarily of the parameters. [9] proposed to optimize the 
£i penalized log-likelihood only over the set of active 



variables and itcrativcly enlarge the set until optinial- 
ity is achieved. However, the resulting graph structure 
is not necessarily sparse. [1] used the log-determinant 
relaxation of the log-partition function [19] to obtain 
a surrogate likelihood that can be easily optimized. 

In this paper we are interested in learning both the pa- 
rameters and the structure of a binary valued Markov 
Random Field with pairwise interactions from ob- 
served data. An important insight into the problem 
of structure estimation is that even if a sparse graph 
is estimated, that does not necessarily mean that the 
inference in the model is tractable (e.g. inference in 
a grid, in which each node has only 4 neighbors, is 
not tractable). Since we are interested in using the 
estimated model for inference, we use the insight ob- 
tained from [16]; i.e. we use the same approximate 
procedure for estimating the parameters and for in- 
ference, as the bias introduced in estimation phase 
can compensate for the error in the inference phase. 
Our method is mostly related to [1], as we propose to 
optimize the £i penalized surrogate likelihood based 
on the log-determinant relaxation. However, in order 
to obtain a better estimate, we efficiently incorporate 
the cutting-plane algorithm [13] for obtaining a tighter 
outer bound on the marginal polytope into our op- 
timization procedure. Using a better approximation 
to the log-partition function is crucial in reducing the 
approximation error of the estimate. In many appli- 
cations, the ambient dimensionality of the model p is 
larger than the sample size n and the classical asymp- 
totic analysis, where the model is fixed and the sam- 
ple size increases, docs not give a good insight into 
the model behavior. We analyze our estimate in the 
high-dimensional setting, i.e. we allow the dimension 
p to increase with the sample size n. Doing so, we are 
allowing a procedure to select a more complex model 
that can represent a larger class of distributions. 

2 Preliminciries 

In this section we briefly introduce MRFs. We present 
why the exact inference in discrete MRFs, in general, 
is intractable, and how to formulate approximate in- 
ference as a convex optimization problem. In Section 
3, we derive our learning algorithm using methods pre- 
sented here. 

Markov Random Fields. Consider an undirected 
graph G with vertex set V = {1,2, ... ,p} and edge 
set E C V X V. A Markov random field consists of 
a random vector X — {Xi, . . . , Xp)'^ E X'p , where the 
random variable Xs is associated with vertex s e V . 
In this paper we will consider binary pairwise MRFs, 
the Ising model, in which the probability distribu- 
tion factorizes as P(a;) = exp {{6, ^p{x)) — A{9)) and 



X e Here (6',^(.t)) denotes the dot prod- 

uct between the parameter vector 9 and poten- 

tials if{x) € W^, and A{e) = logE^^gA-p exp {{9, ^{x))) 
is the log-partition function. Since we are consider- 
ing binary pairwise MRFs, potentials are functions 
over nodes and edges of the form (p{xv) = Xy and 
ip{xu,Xy) = XuXy. For future use, we introduce a 
shorthand notation for the mean parameters rjy = 

E0[(f){Xy)] and r]yy' = E0[(f){Xy,Xy')]. 

Log-partition function. Evaluating the log- 
partition function involves summing over exponen- 
tially many terms and, in general, is intractable. The 
log-partition function can be expressed as an optimiza- 
tion problem, as a variational formulation, using its 
Fenchel-Legendre conjugate dual A*{r]): 



sup{{0,r]} 

rieM 



(1) 



where M := {?7 S M** ] 3p{X) s.t. r] = Ee[(l){x)]} is 
the set of realizable mean parameters rj and the dual 
function A*{r]) = — H [p[x; ■q{9)) is equal to the neg- 
ative entropy of the distribution parametrized by the 
mean parameters rj. For each parameter 9 there is a 
corresponding mean parameter rj G M that maximizes 
(1). The relation is given as: 



r] = VA{9) = Ee[cf>{x)]. 



(2) 



[18, 19] list other properties of log-partition function. 

Log-partition relaxations. The log-partition func- 
tion written in equation (1) defines an optimization 
problem restricted to the set M. . Since the set is a 
polytope, it can be represented as an intersection of a 
finite number of hyperplanes; however', the number of 
hyperplanes needed to describe the set M. grows expo- 
nentially with the number of nodes p. It is important 
to find an outer bound on M that can be easily char- 
acterized and as tight as possible. One outer bound 
can be obtained using the set of points that satisfy 
local consistency conditions LOCAL(G) := 



r^G [-1,1]'^ y{u,v)eE 



Vu+Vv <^ 
llv+Vu <1 



Vuv 

Vu+Vv - Vuv < 1 



By construction, we have Ai C LOCAL(G), but points 
in LOCAL(G) do not necessarily represent mean pa- 
rameters of any probability distribution. Another 
outer boimd can be obtained observing that the second 
moment matrix Mi (77) = Eg[{l x)'^{l x)] is positive 
semi-definite. Therefore we have M. C SDEFi(G) := 
{ry e I Mi{r]) >z O}. 

The outer bound on the polytope M can be further 
tightened by relating it to the cut polytope CUT(G) 
(e.g. [3]) and using known relaxations to the cut poly- 
tope. Mapping between M. and the cut polytope can 



be done using the suspension graph G' — {V',E') 
of the graph G, where V = V U {p + 1} and E' = 
E U + 1) \ V € V}. The suspension graph is cre- 
ated by adding an additional node p + 1 to the graph G 
and connecting each node v & V to the newly created 
node p + 1. 

Definition 1. LetWuv denote the weight of an edge in 

G' . The linear bijection ^cut that maps points rj ^ A4 
to points w G CUT(G") is given by Wy^n+i = ^iVv + 1) 
forv&V and Wuv = ^(1 - 'Huv) for {u, v) e E. 

Using a separation algorithm, it is possible to sepa- 
rate a class of inequalities that define the cut polytope 
and add them to the inequalities defining LOCAL(G) 
to tighten the outer bound. Efficient separation algo- 
rithms are known for several classes of inequalities [3] , 
but in this paper we will use the simplest one, cycle- 
inequalities. Cycle-inequalities can be written as: 



uveC\F 



+ V (1 - Wuv) > 1 



uveF 



(3) 



where C is a cycle in G' and FCC and \F\ is 
odd. The class of cycle-inequalities can be efficiently 
separated using Dijkstra's shortest path algorithm in 
0{n^ logn + n|£;|). 

To further relax the problem (1), we approximate the 
negative entropy A* (ry) by a function B* (77) to obtain 
the approximation B{9) to A{9): 

B{e)= sup {e,v)-B*{r]), (4) 

r,eOUT(G) 

where OUT(G) is a convex and compact set acting 
as an outer bound to A4. Even though it is not re- 
quired, many of the existing entropy approximations 
are strictly convex (e.g. the convexified Bethe entropy 
[17], the Gaussian-based log-determinant relaxation 
[19] or the rcweightcd Kikuchi approximations [20]), 
which guarantees uniqueness of the solution to (4). 

Inference in MRFs. The inference task in MRFs 
refers to finding or approximating the marginal prob- 
abilities (or the mean vector). It can be seen from 
equation (2) that the log-partition plays an important 
role in inference and a good approximation to it is es- 
sential in obtaining good estimates of the marginals. 
Many known approximate inference algorithms can be 
explained using the framework c;xplained above; they 
create an outer bound OUT(G) and use an entropy 
approximation to estimate the marginals (e.g. the log- 
determinant relaxation [19], Belief Propagation [21] 
and tree-reweighted sum-product (TRW) [17] to name 
few). Recent work proposed a cutting-plane algorithm 
[13] that iteratively tightens the outer bound on the 
polytope M using cycle-inequalities and empirically 
obtains improved estimates of marginals. 



3 Structure leeirning and pEirameter 
estimation 

In this section we address the problem of structure 
learning and parameter estimation. Given a sample 
V, we obtain our estimate of the edge set E and pa- 
rameters 9 associated with edges as a maximizer of the 
£1 penalized surrogate log-likelihood: 



^" = argmax e{9;V) - A" ^ \9u 



uv£E 



= argmax {9, r) - B{9) - A" ^ \9uv\, 

uveE 



(5) 



where fj" denotes the mean parameter estimated from 
the sample. A" is a tuning parameter that sets the 
strength of the penalty. Note that a solution to the 
problem (5) simultaneously gives the estimate of the 
edge set E and the parameter vector 9, since if 9uv = 
then the estimated graph does not contain the edge be- 
tween nodes u and v. Since the £1 penalty shrinks the 
edge parameters towards 0, the estimation is biased 
towards sparse models. 

In the remainder of this section we will explain our 
algorithm that efficiently solves (5). Since the prob- 
lem (5) is convex one could use, for example, the 
subgradient method for non-diferentiable functions [2] . 
However, computing the gradient of the log-likelihood 
'^^^QQ^^ = fj^ — Eg[Lp[x)\ requires inference over the 
model with current values of the parameters. Since 
the inference is only approximate, the accuracy of the 
computed gradient heavily depends on the approxi- 
mation used. Note again that the sparsity of the 
graph does not necessarily imply that the inference 
is tractable. Applying the cutting-plane algorithm [13] 
for approximate inference could achieve good accuracy, 
however it would be computationally prohibitive, since 
at each iteration of the subgradient algorithm, the cut- 
ting plane algorithm have to be run anew to obtain the 
mean parameters. Hence, computational inefficiency 
stems from the fact that for each parameter 9, we 
have to compute the corresponding mean parameter 
T}. We present a way to exploit the structure of the 
log-determinant relaxation to obtain both 9 and 77 and 
obtain computationally efficient algorithm. We will 
use the convenient way of representing parameters in 
a matrix: 



/O 



R{9) 



9 



\9p 9ip 92p 



9ip 



Algorithm 1 describes our proposed method for struc- 
ture learning. To obtain the solution of (5) efficiently. 



Algorithm 1 Structure learning with cutting-plane 
algorithm 

1: OUT(G) ^ LOCAL(G) 

2: repeat 

6. ,7]^ max^l max,gouT(G) {{0,v) - B*{v)} ^ 
4: w ^ Create_Suspension_Graph(r7) 
5: C ^ Separate_CycleJnequalities(ti;) 
6: OUT(G) ^OUT(G)nC 
7: until C = 



Algorithm 2 Finding best parameters 
1: W, a <— Initialize() 
2: repeat 

3- maxw / logdet(ty + + diag(m)) 1 

W gW { defined in equation (9) ) 

4: Y ^{W + R{r) + dia.g{m))-^ ^J2^(^^^^ 
5: a maXa>o {log det(y + J2i ctiA) - cJ'b} 
6: until stop criterion 



we use variational representation of B{9) using the log- 
determinant relaxation and jointly optimize over 6 and 
rj. Starting with LOCAL(G) as an outer bound to M, 
the algorithm alternates between finding the best pa- 
rameters and tightening the outer bound OUT{G) by 
incorporating the cycle inequalities that are violated 
by the current mean parameters rj. The algorithm is 
similar, in spirit, to the cutting-plane algorithm [13] 
for inference. However, optimization in line 3 of Al- 
gorithm 1 is done over all parameters jointly, which 
produces some technical challenges. We proceed with 
a procedure for solving the optimization problem (5). 

The formulation in lino 3 arose from using the varia- 
tional form (4) of B{9) in the surrogate likelihood (5). 
The idea behind the log-determinant relaxation [19] is 
to upper bound the log-partition function using the 
Gaussian-based entropy approximation: 

A{e) < sup i logdet(ii(77) + diag(m)) -|- {6, rj), 

?7eOUT(G) 2 

where m G W^'^ = (1, |, . . . , |). To make use of this 
upper bound, we have to rewrite it so that both pa- 
rameters 9 and t] can be extracted from it. Before we 
rewrite the upper bound, notice that after k iterations 
of repeat- until loop in Algorithm 1, we have added k 
cycle-inequalities. It will be useful to rewrite equation 
(3) in a matrix form as 'Tr{AkR(ri)) > bk, where Ak is 
a symmetric matrix for the A;-th inequality. 

Lemma 2. After k iterations of algorithm, we write 
the log-partition relaxation as 



B{e) = f log(^) -hp+l)-l max{i.^m - 

Z Z Z Z u,a.>Q 



+ logdet(-J^(^) - diag(z/) + ^ UiAi)}. 



(6) 



To obtain the mean vector rj* corresponding to 9 we 
take off diagonal elements of the matrix 

Z = {-R{9) - diag(j.) + ^aiA^)-^ 



defined for optimal v, a, i.e. R{ri*) = Z — diag(Z). 



Proof. Due to the lack of space, we leave out technical 
details of the proof and just give a sketch of the main 
idea. The proof is similar to the proof of Lemma 5 in 
[1]. We start from equation (4), where the Gaussian- 
based entropy is used as B*{r]), and rewrite it in the 
Lagrangian form with as a Lagrangian multiplier for 
i-th cycle-inequality. The lemma follows from rewrit- 
ing the Lagrangian in the dual form. □ 

Using Lemma 2, we can rewrite the problem in line 3 
of Algorithm 1 so that both 9 and rj can be extracted. 
Defining Y := — — diag(i^) and dropping constant 
terms the optimization problem is written as: 

-TV(y(il(r;")+diag(m)))+ 
"^f^ \ log det (y + aiAi) -a-^b-X" \Yuv \ 

_ , (7) 
With a = 0, the problem (7) is identical to the problem 
analysed in [1, 7], where Y can be found using a block 
coordinate descent. However, due to the terms that 
arise from the added cycle-inequalities, the Lagrangian 
multipliers a 7^ are different from zero and we need 
a different method to obtain the optimal Y and a. 

We propose Algorithm 2 for solving the problem (7). 
The algorithm iterate between solving Y and a. For 
fixed a, the dual of (7) is given as 



max • 



where 



W = 



logdet(H^ + i?(r7") + diag(TO))- 
TriWiEiaiA,)) 



Wuv = iov u = 1, V = 1, u = V 
\Wuv\ < A" otherwise 



(8) 



(9) 



To solve for Y we apply the subgradient method, in 
which we optimize the dual variable W with the pro- 
jection step onto the box constraint defined in (9). 
The gradient direction is given as VW = {R{fi'^) + 
diag(m) -|- W)~^ — aiAi and the step size can be 
defined as 7* = jo/Vt- For fixed Y, the problem (7) 
reduces to one in line 5 of Algorithm 2. Since the 
dimension of a, corresponding to the number of cycle- 
inequalities, is not large, we can apply any optimiza- 
tion method to find an optimal a. 



Algorithm 2 is guaranteed to eonverge. wliieli can be 
shown from the standard results for block coordinate 
optimization (e.g [14]). Using the same results, we 
could perform the analysis of the convergence rate, but 
that is beyond the scope of this paper. The compu- 
tational complexity of Algorithm 2 is roughly 0{p^), 
so the overall complexity of the structure learning is 
0{p^). The method has lower complexity than the 
method [1] and same as graphical lasso [7]. A use- 
ful trick for boosting the performance of the structure 
learning algorithm is to use a warm-start strategy for 
Algorithm 2, i.e. initialize Y and a to the optimal so- 
lution of the previous iteration. 

An important advantage of our algorithm, over other 
block-coordinate descent algorithms [1, 7], is that we 
can readily modify it to the case when the regulariza- 
tion is given as a constraint on £i ball in which param- 
eters lie, e.g. \&uv\ < C- In that case, we can 
solve directly the primal problem, to obtain Y, using 
an efficient algorithm for projection onto the £i ball [6]. 
The algorithm for efficient projection onto the £i ball 
was also exploited for learning sparse inverse covari- 
ance matrices under the Gaussian assumption on data 
[5]. Finally, the structure learning algorithm could be 
extended to the non-binary MRFs using a generaliza- 
tion to the log-determinant relaxation [19] and project- 
ing the marginal polytope to different binary marginal 
polytopes [13]. 



4 Asymptotic analysis 

In this section, we state our main theoretical result on 
the convergence rate of the £i penalized log-likelihood 
estimate. As opposed to the algorithm described in 
the last section, where we used the log-determinant ap- 
proximation, our theoretical result is applicable to any 
strongly convex surrogate for the log-partition func- 
tion. The asymptotic analysis presented here is high- 
dimensional in nature, i.e. we analyse the estimate in 
the case when both the model dimension p and the 
sample size n tend to infinity. Traditionally, asymp- 
totic analysis is performed for a fixed model letting the 
sample size n to increase, however, that type of anal- 
ysis does not reflect the situation that occurs in many 
real data sets where the dimensionality p is larger than 
the sample size n (e.g. gene arrays, flXlRIs). To get in- 
sight into the behaviour of the estimate, it is therefore 
important to perform the high-dimensional analysis. 

The first question to ask is whether the estimate 6"" 
converges to 9* , the true parameter associated with the 
distribution? Unfortunately, in general, the answer to 
this question is no, since we are using an approxima- 
tion to the log-partition function. Note that we could 
obtain such consistency result if we are willing to as- 



sume that the true graph can be foimd in a restricted 
class of models, e.g. trees for which the approximation 
will give the exact solution. 

The next best thing we can hope for is that our pro- 
cedure produces an estimate that is close to the best 
parameter 6 in the class using the surrogate B{0). In 
general, we cannot tell how far is the best parameter 
in the class from the true 9*, however, letting the 
dimension of the model p to increase with the size of 
the sample and using a good approximation B{6) we 
are able to represent an increasing number of distribu- 
tions and, hence, reduce the size of the approximation 
error. 

Naturally, the next question is whether the estimate 
6*" at least converges to 9? This convergence would be 
obviously true if the model had been fixed, however, we 
are dealing with models of increasing dimensionality. 
In order to guarantee the closeness of the estimate 9" 
to the best parameter 9 we will have to assume that 
the model is sparse and we will show how the rate 
of convergence explicitly depends on the sparsity. We 
proceed with the main result. 

Let rj* be the true mean vector corresponding to 9*. 
Since we use strictly convex conjugate dual pair B 
and B* , the gradient mapping Vi? is one-to-one and 
onto the relative interior of the constrain set OUT(G') 
[11]. In the limit of infinite amount of data, the 
asymptotic value of the parameter estimate is given 
by ^ = V-^B{r]*). Let S = {P \ 9/3 ^ 0} he the index 
set of non zero elements of 9 and let S be its comple- 
ment. Note that the set S indexes nodes and edges 
in the graph and that the size of the set is related to 
the sparsity of the estimated graph. Denote 8=151 
the number of the non-zero elements. The following 
theorem gives us the asymptotic behaviour of the pa- 
rameter estimate To prove the theorem we follow 
a method of Rothman et al. [12]. 

Theorem 3. Let 9^ he the minimizer of (5). If B{6) 

is strongly convex and A„ x \f^^ then, 



= Op 



'{p + s) \ogp 



Proof. Let G : R'' R be a map defined as G{5) := 

£{9+5;V)-\^Y.uv\^uv + 5uv\-mV) + \^Y.uv\{uv\. 
Using the Taylor expansion and the fact that VB{9) = 
ri*, wehave B (9 + 6)- B {9) = {ri* ,6) + ^6'^\7'^[B{e + 
aS)]S, for a € [0, 1]. Now, we may write G as: 

1 



G{6) = {d, - r?*) - -5' W'[B{9 + a5)]6 



— A" ^^(l^ti-y' + 5vv' \ ~ \9v 



(10) 



By construction, our estimate = 6" — 9 rnaxirnizes G 
and we have G((5") > G(0) = 0. The proof continues 



T,vv'es\^^^''\ ^ V^(EweS'^l')^ ^he upper bound 



by showing that for some L > and 



= L we 



have G{5) < 0, which impHcs that \\6"\\2 < L because 
of concavity of G. Appropriately choosing L, we show 
that S converges to 0. 

We proceed by bounding each term in (10). The first 
term can be written as: 

\{s,r-v')\<\ ^nv{fj:i,-v:v)\ + \Ys4fi:-v:)\ 

uvGE veV 
— (*) + (**)• 

(11) 

Using the union sum inequality and Hoeffding's in- 
equality, with probability tending to 1, we have 



logp 



and a bound (*) < Gi^ '-^J2^JSuv\. Using the 
Cauchy-Schwartz inequality and Hoeffding's inequal- 
ity the second term is bounded as 

V V 
V 

The Hessian of a strongly convex function is positive 
definite, so the second term in (10) can be bound as 
follows Tr{5^V'^B{0 + a6)5) > G3 \\5\\l, where C3 is 
a constant that depends on the minimum eigenvalue 
of the Hessian. Using the triangular inequality on the 
third term, we have 

'Y^{\Ovv' + 5vv'\ - \6vv'\) > {'^ \5vv'\ - ^ \5vv'\)- 

vv' vv'es vv'es 

Now we can define the constant L as. 



L = M 



.slogp plogp 



0. 



Taking A" = ^^f^^t we have an upper bound on 
G: 



uves 



uves 



First term is negative for small e so we can re- 
move it from the upper bound. Using the fact that 



becomes G(^<5) < (^^ - Cs)iEr., ^1' 
G3){J2v ^v)^ < for sufHciently large M and the the- 
orem follows. □ 



5 Experimental Results 

In this section we compare the performance of our es- 
timation method for sparse graphs as well as Bancr- 
jee et al. [1] and Wainwright et al. [15] on simulated 
data. In order to assess the performance we compare 
speed, accuracy of the structure selection and accuracy 
of the estimated parameters. Method of Wainwright 
et al. [15] produces two estimates 9uv and 6vu for each 
edge parameter. There are two ways how we can sym- 
metrize the solution: 



and 



V 

Ovu if 



if 1^™! > 1^ V 
Ovu if \Ouv\ < 1^ 



"Wainwright_min" , 



"Wainwright_max" . 



We have implemented the method of Wainwright et al. 
[15] using a coordinate descent algorithm for logistic 
regression [8]. To compare the method of Banerjce et 
al. [1] we used "COVSEL" package available from the 
authors website. 

We use three types of graphs for experiments: (a) 

4- nearest- neighbor grid models, (b) sparse random 
graphs, and (c) sparse random graphs with dense sub- 
graphs. The grid model represents a simple sparse 
graph, which does not allow for exact inference due 
to the large tree width. To create a random sparse 
graph, we choose a total number of edges and add 
them between random pairs of nodes, taking into an 
account the maximum node degree. Random graphs 
with dense subgraphs are globally sparse, i.e. they have 
few edges, however there are local subgraphs that are 
very dense, with strong interactions between nodes. 
To generate them, wc first create dense components 
and then randomly add edges between different com- 
ponents. For a given graph, we assign each node a 
parameter 9^ ~ Z^[— 1,1] and each edge a parameter 
9uv ^ W[— ^,^], where ^ is the coupling strength. For 
a given distribution Pg. of the model wc generate ran- 
dom data sets a;*-^-*, . . . ,a;("^} of i.i.d. points us- 
ing Gibbs sampling. Every experimental result is av- 
eraged over 50 runs. 

We first comment on the speed of convergence of the 
methods. We have decided not to plot graphs with 
speed comparisons due to the use of different program- 
ming languages, however, from our limited experience 



wc observe that the method of Wainwright ot ah [15] 
is the fastest and the running time does not depend 
much on the underlying sparsity of the graph. Our 
method compared favorably to the method of Bancr- 
jee et al. [1], however, increasing the density of the 
graph or increasing the coupling strength ^ resulted in 
an increase of the number of added cycle-inequalities 
needed for the approximation and in a slower conver- 
gence. From the comments on the speed of the meth- 
ods, we can suspect that there is a trade-off between 
speed and accuracy. 

Next, we compare the accuracy of the edge selection. 
For this experiment we create a random sparse graphs 
with p = 50 nodes and 100 edges, such that the cou- 
pling strength is ^ = 0.5. Then we vary the sam- 
ple size n from 100 to 1000. Figure 1 shows preci- 
sion and recall of edges included into the graph for a 

regularization parameter set as A„ = 2 * \f^^^- We 
have excluded the method of Banerjee et al. [1] in the 
figure, since for this experiment the estimated struc- 
ture was identical. From plots we can see that our 
method and "Wainwright_min" produce similar results 
and perform better than "Wainwrightjnax" . Similar 
conclusions can be drawn for the grid model (results 
not shown). To shed some more light on these results, 
it is instructive to discuss the speed at which the edges 
get included into the model as the function of param- 
eter A„. "Wainwright_max" produces estimates that 
include edges the fastest and the resulting graph is 
the densest which can explain low precision due to 
many spurious edges that get included into the model. 
"Wainwright_min" includes edges most conservatively, 
while our solution produces graphs that according to 
their denseness fall in between "Wainwright_min" and 
"Wainwright-max" estimates. Next, we move onto a 
harder problem of estimating the structure of graphs 
that have strong couplings between nodes. For this 
experiment, we construct a random graph with two 
dense subgraphs, which are fully connected graphs of 
size 8. Graphs have total of j? = 50 nodes and 100 
edges. We vary the coupling strength ^ in interval 
[1, 10]. The structure is estimated from a sample size 
of n = 500 and the results are presented in Figure 2. 
In this experiment we start to notice a difference be- 
tween the estimated models using our method and the 
method [1]. As the coupling strength increases, the 
mean parameter is not captured within the constraints 
defined by the cycle-inequalities and our method has 
to add the violated cycle-inequalities to the constraint 
set OUT(G) which produces a different estimate. 

Final set of experiments measures how well the esti- 
mated model fits the observed data and for that pur- 
pose we use surrogate log-likelihood. While the true 
measure of the fit should be the log-likelihood, it is 
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Figure 1: Comparing edge recovery of a sparse random 
graph: p = 50,^ = 0.5, Hedges = 100. 
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Figure 2: Comparing edge recovery of a graph with 
dense subgraph: p = 50, n = 500, Hedges = 100. 



not possible to use it due to the problem of evaluating 
the log-partition function. Furthermore, in practise we 
always use the surrogate log-likelihood when choosing 
a model from data, so our experiments can be justi- 
fied. Figure 3 shows the fit for data generated from 
the grid model. As in estimation of the structure, 
for this sparse model, there is no difference between 
our method and the method of Banerjee et al. [1]. 
One explanation of this result is that the outer bound 
on the polytope implicitly defined through log- 
determinant that act as a log-barrier, captures the es- 
timated mean parameter and all the cycle-inequalities 
are satisfied. Next, similarly to the structure estima- 
tion, we generate a graph with dense subgraphs and 
present results in Figure 3. Again, we can see that 
our method start to perform better as we increase the 
strength of couplings. On Figure 3 wc also plot the fit 
for parameters estimated from the method of Wain- 
wright et al. [15], however, since the method uses a 
different pseudo-likelihood, it performs worse than the 
other two methods. 

To summarize, we have shown that on tlic task of es- 
timating the structure OTir method performs similarly 
to the method of Banerjee et al. [1] and that both 
methods estimate structure that falls between "Wain- 
wright_min" and "Wainwright_max" estimates. When 
measuring how well do learned models fit the data, we 
observe that our method outperforms the method of 
Banerjee et al. [1] when the model has dense subgraphs 
and strong interactions between nodes. 
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Figure 3: Test surrogate log-likelihood. Ouv € [— ^,^]- 

(a) Sparse grid p = 50, n = 500; (b) Graph with dense 
subgraphs p = 50, n = 200, Nedges = 100. 

6 Conclusion 

In the paper we have presented a method for jointly 
learning the structure and the parameters of an undi- 
rected MRF from the data. Our method is useful 
when there are strong correlations between nodes in 
the graph or when the true graph is not too sparse. If 
the true graph is very sparse, our algorithm efficiently 
finds the estimate without performing the cutting- 
plane step. We showed how to incorporate the class 
of cyc;k;-inequalities into the algorithm, which are par- 
ticularly valuable as they can be separated efBciently 
and added as needed to improve the solution. Further- 
more, our algorithm can be efficiently combined with 
the algorithm for projection onto the ii ball in cases 
when the parameters are constrained to lie in the ii 
ball. 

We have analyzed convergence rate of an estimate 
based on maximizing a penalized surrogate likelihood 
in high-dimensional settings. We have given a rate 
that depends on the number of non-zero elements of 
the best parameter for the surrogate likelihood. Such 
analysis provides insight into the performance of the 
method in high-dimensional setting, when both p and 
n are allowed to grow. 
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